This course will help me getting familiar with R, RStudio and GitHub. I hope I can learn some basics of data science and apply this knowledge on some of the data gathered during my PhD.
Everything went smoothly with first assignment. Installation of the 3 programs went ok, and everything seems to be running. It is my first time working with GitHub and Rstudio. I have been trough some of the R Markdown syntax, and the language is quite strange for me. I didn’t quite understood what the Personal Access Token brings more in terms of connection between RStudio and GitHub, as you already are working with the shared IODS.
date()
## [1] "Mon Dec 4 17:50:15 2023"
A full data set was provided for this exercise (https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-data.txt). By reading the txt file and exploring the dimensions”dim()” and structure”str()” of the data we can see that the row data contains 183 observations and 64 variables.
We want to group the combination variables inside a specific group. The variables beginning with “D”, correspond to the deep questions, variables beginning with “SU” correspond to Surface questions, and variables begging with “ST”correspond to Strategic questions.
By combine multiple questions into combination variables, we have now 7 variables in total. From the 183 observations, we have excluded 17 observations, which are the the exam points marked as zero. so, in summary we have now 166 observations and 7 variables.
This was done inside the R script “create_learning_2014”. After arranging the data I have created the cvs file inside the folder ‘data’ which will be use for the analysis.
library(dplyr)
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
library(ggplot2)
library(tidyverse)
## Warning: package 'readr' was built under R version 4.3.2
#reading our csv file and explore a bit their content by seeing dimensions and structure.
output_learning2014 <- read.csv("data/output_learning2014.csv",sep = ",", header = T)
#dimensions of out data frame
dim(output_learning2014)
## [1] 166 7
#structure
str(output_learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
#quick overview of the structure and values
head(output_learning2014)
## gender Age attitude deep stra surf Points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
#library(GGally)
#library(ggplot2)
#output_learning2014 <- read.csv("data/output_learning2014.csv",sep = ",", header = T)
# create a more advanced plot matrix with ggpairs()
#graphical overview of the data and show summaries of the variables in the data.
p <- ggpairs(output_learning2014, mapping = aes(col=gender, alpha=0.3),
lower = list(combo = wrap("facethist", bins = 20)))
#color aesthetic (col) is mapped to the "gender" variable, and alpha is set to 0.3 for transparency.
p
When we use, not just pairs() but work toguether with the libraries GGally and ggplot2 we have a more detailed look at the variables, their distributions and relationships.
The ggpairs give us a graphical overview of our data and give us
summaries of the variables. We can see from the matrix chart the
distributions of ‘Age’, ‘attitude’, ‘deep’, ‘stra’, ‘surf’‘, and
’points’. For example, for the ‘Age’, we con observe that we have the
pick of our distribution around the age of 20 for females.
Also, we have the correlations between ‘Age’ vs ‘Attitude’, ‘Age’ vs
‘deep’, ‘Age’ vs ‘Stra’, ‘Age’ vs ‘surf’, ‘Age’ vs ‘Points’, ‘attitude’
vs ‘deep’, ‘attitude vs ’stra’, ‘attitude’ vs ‘surf’, ‘attitude’ vs
‘Points’, ‘deep’ vs ‘stra’, ‘deep’ vs ‘surf’, ‘deep’ vs ‘Points’, ‘stra’
vs ‘surf’, ‘stra’ vs ‘Points’, ‘surf’ vs ‘Points’.
Now we are creating the module for the regression asked in the exercise. I have chosen ‘attitude’, ‘deep’ and ‘stra’ as explanatory variables, and have fitted a regression model where exam points is the target variable.
#library(GGally)
#library(ggplot2)
#output_learning2014 <- read.csv("data/output_learning2014.csv",sep = ",", header = T)
#Points will be the target variable.
#explanatory variables will be attitude, deep and stra
#regression model is the following, taking in consideration the up comments
qplot(attitude + deep + stra, Points , data = output_learning2014) + geom_smooth(method = "lm")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# fitting a linear model
model1 <- lm(Points ~ attitude + deep + stra, data = output_learning2014)
# printing my summary of the model
summary(model1)
##
## Call:
## lm(formula = Points ~ attitude + deep + stra, data = output_learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## deep -0.7492 0.7507 -0.998 0.31974
## stra 0.9621 0.5367 1.793 0.07489 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
From the R console we can look for the Estimate values and understand what happens to target variable (if it goes up or down and by how much). In this example, we can estimate that if we increase by one unit in ‘Points’ then the ‘attitude’ will increase 3.5254, ‘deep’ will decrease 0.7492 and ‘stra’ will increase 0.9621.
Also, we have the Multiple R-squared, which explains that 19.4% of the variation in the dependent variable. It is the percentage of how much has been explained by these predictors.
I have removed the explanatory variable ‘deep’ from the code and fitted again the model.
#library(GGally)
#library(ggplot2)
#output_learning2014 <- read.csv("data/output_learning2014.csv",sep = ",", header = T)
#explanatory variables will be attitude and stra
#Points will the target variable.
#regression model is the following, taking in consideration the up comments
qplot(attitude + stra, Points , data = output_learning2014) + geom_smooth(method = "lm")
# fitting a linear model
model2 <- lm(Points ~ attitude + stra, data = output_learning2014)
# print out again my summary of the model
summary(model2)
##
## Call:
## lm(formula = Points ~ attitude + stra, data = output_learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
I have set up a 2x2 Grid for the next plots using the function par().
I have produce diagnostic plots using the plot() function with the which argument to specify which diagnostic plots i want, according with the assignment. In this case, which = c(1, 2, 5) will produce me Residuals vs Fitted values, Normal QQ-plot, and Residuals vs Leverage plots.
For the first one, residuals should be independent of each other. There should be no systematic pattern in the residuals over time or across observations. In our case, we see that the residuals are equally and randomly distributed around 0, with essentially no pattern. Also, overall we see that the spread of the residuals is roughly equal across the range of fitted values. As residuals are linearly distributed, variance is uniform.
A Normal Quantile-Quantile (QQ) plot of the residuals can be used to assess normality. The points on the QQ plot don’t deviate significantly from the diagonal line, which suggests are normally distributed.
With the residuals vs leverage plot We’re looking at how the spread of standardized residuals changes as the leverage The spread of standardized residuals shouldn’t change as a function of leverage. Our model tries to minimize the distance of the line from all the points. Points which are further from the line can sometimes have a greater influence over the plot and deleting them can change the model a lot. We can observe that we have more points concentrated on the left part of the plot, and some of the points close to the Cook’s distance which can mean that those observations are not exerting a high influence on the regression coefficients.
#library(GGally)
#library(ggplot2)
#output_learning2014 <- read.csv("data/output_learning2014.csv",sep = ",", header = T)
#my grid
par(mfrow = c(2,2))
#draw of the diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
plot(model2, which=c(1,2,5))
This data approach student achievement in secondary education of two Portuguese schools. The data include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In the R script ’ create_alc.R’ I have joined the two data sets using all other variables than “failures”, “paid”, “absences”, “G1”, “G2”, “G3” as (student) identifiers. At the end our data set has 370 observations and 35 variables.
Because we will use it for the analysis, one think important to understand is our logical column ‘high_use’, which was created in the wrangling exercise part. This column shows as TRUE when the average of the answers related to weekday and weekend alcohol consumption is > 2 and FALSE otherwise.
For the exercise I will choose some of the variables, which will be explained later above.
library(tidyr)
library(readr)
library(dplyr)
library(ggplot2)
alc <- read.csv("data/alc.csv",sep = ",", header = T)
variables_names <- colnames(alc)
variables_names
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
dim(alc)
## [1] 370 35
str(alc)
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
The variables that will be in interest to us are ‘goout’, which means going out with friends (numeric: from 1 - very low to 5 - very high), ‘failures’ which are the number of past class failures (numeric: n if 1<=n<3, else 4), ‘absences’ which are the number of school absences (numeric: from 0 to 93), ‘G3’ represents the final grade (numeric: from 0 to 20, output target).
For each variable we want to analyse its relationship with the alcohol consumption. So for each variable we can formulate a priori one personal hypothesis: Higher levels of going out with friends (‘goout’) might be associated with an increase in alcohol consumption. Students who go out more frequently may have higher alcohol consumption. Students with a higher number of past class failures (‘failures’) may exhibit higher alcohol consumption. A higher number of school absences (‘absences’) may be associated with higher alcohol consumption. Students who are frequently absent might have more opportunities for social activities, including alcohol consumption. There might be an inverse relationship between the final grade (‘G3’) and alcohol consumption. Students with lower grades might be more likely to engage in higher alcohol consumption.
I have analysed the data with our 4 variables using cross-tabulations, bar plots and box plots. The results shown to be a bit different from the hypotheses. We can conclude that we have more high use of alcohol when students are going out more often . The students who have high level of alcohol are the ones with zero failures at classes. The high level of alcohol consumption decreases with the number of failures. Students with more class absences present less high alcohol ingestion. Students with meddle range grades present the ones with more alcoohl consuption.
alc %>% group_by(failures) %>% summarise(count = n())
## # A tibble: 4 × 2
## failures count
## <int> <int>
## 1 0 325
## 2 1 24
## 3 2 17
## 4 3 4
alc %>% group_by(absences) %>% summarise(count = n())
## # A tibble: 26 × 2
## absences count
## <int> <int>
## 1 0 63
## 2 1 50
## 3 2 56
## 4 3 38
## 5 4 35
## 6 5 22
## 7 6 21
## 8 7 12
## 9 8 20
## 10 9 11
## # ℹ 16 more rows
alc %>% group_by(G3) %>% summarise(count = n())
## # A tibble: 18 × 2
## G3 count
## <int> <int>
## 1 0 2
## 2 2 1
## 3 3 1
## 4 4 6
## 5 5 5
## 6 6 21
## 7 7 5
## 8 8 26
## 9 9 7
## 10 10 63
## 11 11 25
## 12 12 78
## 13 13 25
## 14 14 38
## 15 15 16
## 16 16 37
## 17 17 7
## 18 18 7
alc %>% group_by(goout) %>% summarise(count = n())
## # A tibble: 5 × 2
## goout count
## <int> <int>
## 1 1 22
## 2 2 97
## 3 3 120
## 4 4 78
## 5 5 53
#alc %>%
# group_by(high_use, sex) %>%
# summarise(failures = median(failures))
#summary(alc$address)
#cross-tabulations
# Cross-tabulation of 'goout' with 'alc_use'
cross_tab_goout <- table(alc$goout, alc$high_use)
print("Cross-tabulation of 'goout' with 'high_use'")
## [1] "Cross-tabulation of 'goout' with 'high_use'"
cross_tab_goout
##
## FALSE TRUE
## 1 19 3
## 2 82 15
## 3 97 23
## 4 40 38
## 5 21 32
# Cross-tabulation of 'failures' with 'alc_use'
cross_tab_failures <- table(alc$failures, alc$high_use)
print("Cross-tabulation of 'failures' with 'high_use'")
## [1] "Cross-tabulation of 'failures' with 'high_use'"
cross_tab_failures
##
## FALSE TRUE
## 0 238 87
## 1 12 12
## 2 8 9
## 3 1 3
# Cross-tabulation of 'absences' with 'alc_use'
cross_tab_absences <- table(alc$absences, alc$high_use)
print("Cross-tabulation of 'absences' with 'high_use'")
## [1] "Cross-tabulation of 'absences' with 'high_use'"
cross_tab_absences
##
## FALSE TRUE
## 0 50 13
## 1 37 13
## 2 40 16
## 3 31 7
## 4 24 11
## 5 16 6
## 6 16 5
## 7 9 3
## 8 14 6
## 9 5 6
## 10 5 2
## 11 2 3
## 12 3 4
## 13 1 1
## 14 1 6
## 16 0 1
## 17 0 1
## 18 1 1
## 19 0 1
## 20 2 0
## 21 1 1
## 26 0 1
## 27 0 1
## 29 0 1
## 44 0 1
## 45 1 0
# Cross-tabulation of 'G3' with 'alc_use'
cross_tab_G3 <- table(alc$G3, alc$high_use)
print("Cross-tabulation of 'G3' with 'high_use'")
## [1] "Cross-tabulation of 'G3' with 'high_use'"
cross_tab_G3
##
## FALSE TRUE
## 0 1 1
## 2 0 1
## 3 1 0
## 4 5 1
## 5 3 2
## 6 15 6
## 7 3 2
## 8 18 8
## 9 3 4
## 10 38 25
## 11 16 9
## 12 51 27
## 13 18 7
## 14 31 7
## 15 15 1
## 16 29 8
## 17 5 2
## 18 7 0
#bar plots
# Bar plot for 'goout' and its relationship with 'alc_use'
ggplot(data=alc, aes(x = goout, fill = high_use)) +
geom_bar() +
labs(x = "Go Out with Friends", y = "Count", fill = "Alcohol Use")
# Bar plot for 'failures' and its relationship with 'alc_use'
ggplot(alc, aes(x = failures, fill = high_use)) +
geom_bar() +
labs(x = "Number of Failures", y = "Count", fill = "Alcohol Use")
#for the last two plots, I've tryed diferent way of ploting the bars
# Bar plot for 'absences' and its relationship with 'alc_use'
ggplot(alc, aes(x = absences, fill = high_use)) +
geom_bar(position = "dodge", stat = "count") +
labs(x = "Number of Absences", y = "Count", fill = "Alcohol Use")
# Bar plot for 'G3' and its relationship with 'alc_use'
ggplot(alc, aes(x = G3, fill = high_use)) +
geom_bar(position = "dodge", stat = "count") +
labs(x = "Final Grade (G3)", y = "Count", fill = "Alcohol Use")
#boxplots
g_failures <- ggplot(alc, aes(x = high_use, y = failures, col = sex))
g_failures + geom_boxplot() + ylab("failures")
g_g3 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g_g3 + geom_boxplot() + ylab("grade")
g_absences <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g_absences + geom_boxplot() + ylab("absences")
g_goout <- ggplot(alc, aes(x = high_use, y = goout, col = sex))
g_goout + geom_boxplot() + ylab("going_out")
I’ve used logistic regression to explore the relationship between my variables and the high/low alcohol consumption variable as the target variable.
For the variable ‘failures’, for each one-unit increase in the number of failures, the response of the high/low alcohol consumption (target variable) increases by 0.52078. The p-value (0.028128 ) is less than 0.05, indicating that the coefficient for failures is statistically significant.
For absences, for each additional absence, the log-odds of the response variable increases by 0.07373. The p-value is also less than 0.05 (0.000958), indicating that the coefficient for absences is statistically significant.
For the final grade (G3), for each one-unit increase in the G3 variable, the response of the target variable decreases by 0.01612. Here, the p-value (0.699253) is greater than 0.05, indicating that the coefficient for G3 is not statistically significant. Therefore, the variable G3 is not contributing significantly to the model.
For the variable of going out with friends, for each one-unit increase in the goout variable, the response of the target variable increases by 0.69791.The z-value (5.874) corresponds to a very small p-value (4.25e-09), which is highly statistically significant.
The second part of the code fits a logistic regression model, extracts the coefficients, computes the odds ratios, calculates confidence intervals, and then prints out the odds ratios with their confidence intervals. The confint() function is used to obtain the confidence intervals for the coefficients, and the exp() function is applied to exponent the values. The cbind() function is used to combine the odds ratios and confidence intervals into a single matrix for printing.
For failures, for each unity increased in the number of failures, the odds of the response variable increases by a factor of 1.6833396. (In summa, an increase in the number of failures is associated with an increase in the odds of high alcohol use).The 95% confidence interval for the odds ratio is (1.062542815 - 2.7040414).
For absences, for each unit increased in the number of absences, the odds of the response variable increases by a factor of 1.0765213 (in summa, an increase in the number of absences is associated with a slight increase in the odds of high alcohol useHere). The 95% confidence interval for the odds ratio is 1.031833703 - 1.1275904. For the final grades variable the odds of the target variable is 0.9840135x the odds of the reference level. The 95% confidence interval is 0.906930755 - 1.0685440. For the variable of going out, the odds of the target variable is 2.0095429 x the odds of the reference level. The 95% confidence interval is 1.600845280 - 2.5531997.
Regarding the Intercept, the odds of the baseline level of the target variable is 0.0326675 x the odds of the reference level when all predictors are zero. This indicates a lower likelihood of high alcohol use for the reference level. The 95% confidence interval for the odds ratio is 0.007953009 - 0.1220042.
# Explore the dataset
str(alc)
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
# Fit the logistic regression model
model <- glm(high_use ~ failures + absences + G3 + goout,
data = alc,
family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ failures + absences + G3 + goout, family = "binomial",
## data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.42137 0.69467 -4.925 8.43e-07 ***
## failures 0.52078 0.23720 2.195 0.028128 *
## absences 0.07373 0.02233 3.303 0.000958 ***
## G3 -0.01612 0.04171 -0.386 0.699253
## goout 0.69791 0.11881 5.874 4.25e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 383.91 on 365 degrees of freedom
## AIC: 393.91
##
## Number of Fisher Scoring iterations: 4
coef(model)
## (Intercept) failures absences G3 goout
## -3.42137467 0.52077970 0.07373480 -0.01611567 0.69790726
# compute odds ratios (OR)
OR <- coef(model) %>% exp
# compute confidence intervals (CI)
CI <- confint(model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
result <- cbind(OR, CI)
print(result)
## OR 2.5 % 97.5 %
## (Intercept) 0.0326675 0.007953009 0.1220042
## failures 1.6833396 1.062542815 2.7040414
## absences 1.0765213 1.031833703 1.1275904
## G3 0.9840135 0.906930755 1.0685440
## goout 2.0095429 1.600845280 2.5531997
#model2 <- glm(high_use ~ failures + absences + G3 + goout -1, data = alc, family = "binomial")
# summary(model2)
#coef(model2)
The next part, provides insights into the performance of your logistic regression model, including accuracy, proportions, and inaccuracy metrics.
The scatter plot visually represents the predicted probabilities against the actual values. The points are colored based on whether the prediction is above (TRUE) or below (FALSE) the threshold of 0.5.
I computed a 2x2 confusion matrix to compare actual values and predictions. The confusion matrix provides counts of true positives, true negatives, false positives, and false negatives. For example, there are 252 true negatives (actual: FALSE, predicted: FALSE) and 33 true positives (actual: TRUE, predicted: TRUE).
By adding Margins to the proportional confision Matrix, gives us the row and column sums, providing an overview of the distribution of actual and predicted values. It shows that 70% of the observations are predicted as FALSE, and 30% are predicted as TRUE.
I’ve defined a loss function to compute the average number of wrong predictions. This indicates that if we predict all observations as 0 (using a threshold of 0.5), the average inaccuracy is 30%. also, if we predict all observations as 1 (using a threshold of 0.5), the average inaccuracy is 0.7 or 70%.
I’ve adjusted the code to change the prob argument by giving it the prediction probabilities in alc, which give us the indication that the average inaccuracy is 22.97% using the actual predicted probabilities from the model.
m <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use))
# add the aesthetic element col = prediction and draw the plot again
g + geom_point(aes(col = prediction))
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 252 7
## TRUE 78 33
# adjust the code: use %>% to apply the prop.table() function on the output of table()
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
## prediction
## high_use FALSE TRUE
## FALSE 0.68108108 0.01891892
## TRUE 0.21081081 0.08918919
# adjust the code: use %>% to apply the addmargins() function on the output of prop.table()
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68108108 0.01891892 0.70000000
## TRUE 0.21081081 0.08918919 0.30000000
## Sum 0.89189189 0.10810811 1.00000000
#innacuracy
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
# adjust the code: change the prob argument in the loss function to prob = 1
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
# adjust the code again: change the prob argument by giving it the prediction probabilities in alc (the column probability)
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2297297
Bonus and Super Bonus
Cross-validation provides a more robust estimate of a model’s performance by assessing its generalization to unseen data. The inaccuracy values give you an idea of how well the model is performing in terms of prediction errors. Lower inaccuracy values are generally desirable. The results [0.2297297] means the average inaccuracy (wrong predictions) on the training data is approximately 22.97%. So, on average, the model incorrectly predicts the outcome for about 22.97% of the observations when each observation is left out once during the cross-validation process.
In 10-fold cross-validation, the average inaccuracy on the testing data is approximately 24.05%. This means that, on average, the model incorrectly predicts the outcome for about 24.05% of the observations when the dataset is split into 10 folds, and the model is trained and tested on different subsets in each iteration.
The results suggest that the model’s performance is relatively consistent between leave-one-out cross-validation and 10-fold cross-validation. Both estimates of average inaccuracy are in a similar range, indicating that the model is making incorrect predictions for around 22-24% of the observations.
library(boot)
## Warning: package 'boot' was built under R version 4.3.2
m <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2297297
# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2405405
# 10-fold cross validation.
cv_10fold <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv_10fold$delta[1]
## [1] 0.2405405
# Fit the initial logistic regression model with a high number of predictors
full_model <- glm(high_use ~ .,
data = alc,
family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Extract the initial set of predictors
initial_predictors <- c("failures", "absences", "G3", "goout", "sex")
# Create a list to store the results
cv_results_list <- list()
# Perform cross-validation with different sets of predictors
for (i in length(initial_predictors):1) {
current_predictors <- initial_predictors[1:i]
# Fit the logistic regression model with the current set of predictors
current_model <- glm(high_use ~ .,
data = select(alc, all_of(c(current_predictors, "high_use"))),
family = "binomial")
# Perform K-fold cross-validation
cv_results <- cv.glm(data = select(alc, all_of(c(current_predictors, "high_use"))),
cost = loss_func, glmfit = current_model, K = nrow(alc))
# Store the results
cv_results_list[[i]] <- cv_results
}
# Extract training and testing errors
training_errors <- sapply(cv_results_list, function(x) x$delta[1])
testing_errors <- sapply(cv_results_list, function(x) x$delta[2])
# Create a data frame for plotting
results_df <- data.frame(
Predictors = length(initial_predictors):1,
Training_Error = training_errors,
Testing_Error = testing_errors
)
# Plot the trends of both training and testing errors by the number of predictors
ggplot(results_df, aes(x = Predictors)) +
geom_line(aes(y = Training_Error, color = "Training Error")) +
geom_line(aes(y = Testing_Error, color = "Testing Error")) +
scale_color_manual(values = c("Training Error" = "blue", "Testing Error" = "red")) +
labs(x = "Number of Predictors", y = "Error Rate", title = "Cross-Validation Error by Number of Predictors") +
theme_minimal()
Analysis Exercise
I’ve loaded the data from the MASS package. In the exercise it didn’t mention the data name, so I assumed we will be using the Boston dataset from MASS, as we did in the Exercise 4.
I have explored the structure and dimension of the data. I can observe that this dataset has 506 observations and 14 variables. I used the corrplot to visualize the correlation between variables of the Boston dataset. The 14 variables are described as followed:
library(tidyr)
library(readr)
library(dplyr)
library(ggplot2)
library(GGally)
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
# load the data from the MASS package
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#checking if I can extract some valuable info from pairs and ggplots
pairs (Boston)
p_boston <- ggpairs(Boston, mapping = aes(alpha=0.3),
lower = list(combo = wrap("facethist", bins = 20)))
p_boston
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle")
print(cor_matrix)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## black lstat medv
## crim -0.38506394 0.4556215 -0.3883046
## zn 0.17552032 -0.4129946 0.3604453
## indus -0.35697654 0.6037997 -0.4837252
## chas 0.04878848 -0.0539293 0.1752602
## nox -0.38005064 0.5908789 -0.4273208
## rm 0.12806864 -0.6138083 0.6953599
## age -0.27353398 0.6023385 -0.3769546
## dis 0.29151167 -0.4969958 0.2499287
## rad -0.44441282 0.4886763 -0.3816262
## tax -0.44180801 0.5439934 -0.4685359
## ptratio -0.17738330 0.3740443 -0.5077867
## black 1.00000000 -0.3660869 0.3334608
## lstat -0.36608690 1.0000000 -0.7376627
## medv 0.33346082 -0.7376627 1.0000000
By examining the correlation matrix and the corrplot, you can gain insights into the relationships between our different variables. Positive correlations (values closer to 1) indicate a positive relationship, negative correlations (values closer to -1) indicate a negative relationship, and values close to 0 suggest weak or no linear relationship.
# calculate the correlation matrix and round it using the pipe (%>%)
cor_matrix <- cor(Boston) %>% round(2)
# print the rounded correlation matrix
print(cor_matrix)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the rounded correlation matrix with corrplot
library(corrplot)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
The Boston data contains only numerical values, so we can use the function ‘scale()’ to standardize the whole dataset. I created a categorical variable of the crime rate in the Boston dataset from a continuous one. The variables have been successfully standardized, a categorical variable ‘crime’ has been created, and the dataset is split into training and testing set. Each variable has been standardized, with a mean of 0 and a standard deviation of 1. The minimum, maximum, and quartile values for each variable are now on a standardized scale. The crime rate has been categorized into four groups based. There are 127 observations in the category [-0.419, -0.411], 126 in (-0.411, -0.39], 126 in (-0.39, 0.00739], and 127 in (0.00739, 9.92].
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
# look at the table of the new factor crime
table(crime)
## crime
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 127 126 126 127
# adjust the cut() function with labels
labels <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, labels = labels, include.lowest = TRUE)
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
The LDA model has been trained on the training data and provides information on how predictor variables contribute to the classification of crime groups. LD1 explains the majority of the variance. The confusion matrix indicates how well the model performs on the test dataset, with a count of correct and incorrect predictions for each crime group.
The probabilities of each group occurring in the training data before considering any predictor, can be extracted from the Prior Probabilities of Groups. We have 25.74% of probability of the crime goup beeing ‘low’, 27.23% for med_low, 23.51% for ‘med_high’ and 23.51% for high.
For the group means, the values represent the mean standardized values for each predictor variable within each crime group. For example, in the “low” crime group, the average value of zn is approximately 0.97. For the coefficients of linear discriminants, the coefficients represent the loadings of each predictor variable on the linear discriminant functions (LD1, LD2, LD3). Positive coefficients means a positive association with the corresponding discriminant function.
The propotion of trace is showing the proportion of the total variance explained by each discriminant function. In our cae, LD1 explains approximately 95.22% of the variance, LD2 explains 3.55% and LD3 explains 1.23%.
The confusion matrix compares the predicted classes (low, med_low, med_high, high) with the correct classes in the test dataset. Our diagonal represents correct predictions, and off-diagonal elements represent misclassifications.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2400990 0.2574257 0.2450495
##
## Group means:
## zn indus chas nox rm age
## low 1.0619025 -0.8622496 -0.08304540 -0.8933813 0.4431922 -0.9430255
## med_low -0.1475383 -0.2650784 -0.02879709 -0.5552888 -0.1295236 -0.3435218
## med_high -0.3619070 0.1409927 0.14409500 0.3700615 0.1983658 0.4407578
## high -0.4872402 1.0171737 -0.07348562 1.0390781 -0.4073487 0.7955485
## dis rad tax ptratio black lstat
## low 0.9541974 -0.6848158 -0.7050298 -0.4329275 0.3762103 -0.7828757
## med_low 0.3386274 -0.5556360 -0.4732051 -0.0937466 0.3502584 -0.1252047
## med_high -0.3638743 -0.4087420 -0.3235799 -0.3454319 0.1006397 -0.0601055
## high -0.8444553 1.6375616 1.5136504 0.7801170 -0.7774242 0.8442379
## medv
## low 0.49909956
## med_low 0.01952406
## med_high 0.22257021
## high -0.64155880
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.084115881 0.787950120 -0.96084269
## indus 0.004168709 -0.056627323 0.11242349
## chas -0.113950283 -0.001832699 0.05713375
## nox 0.412168618 -0.626804825 -1.33207384
## rm -0.115587892 -0.098175872 -0.16020509
## age 0.236643664 -0.374889506 -0.30938815
## dis -0.107526518 -0.113995434 0.03439920
## rad 3.221522895 0.868293909 -0.05866800
## tax -0.066284061 0.001516003 0.63290660
## ptratio 0.118493874 0.055967426 -0.29919594
## black -0.133616744 0.019093775 0.17910746
## lstat 0.156302596 -0.220116594 0.59543759
## medv 0.176169488 -0.390394304 -0.03008351
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9406 0.0442 0.0152
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 9 11 3 0
## med_low 9 13 7 0
## med_high 0 7 14 1
## high 0 0 0 28
I calculated the total within-cluster sum of squares (twcss) for different numbers of clusters and plot the results to find the elbow point. The for visualizing the Clusters I’ve run the k-means algorithm again with the optimal number of clusters and visualize the clusters using the pairs() function. The pairs() function helped me to visualize the clusters in a scatterplot matrix. The color of points represents the cluster membership. Each point in the scatterplot matrix corresponds to an observation, and the grouping is based on the k-means clustering. The clustering helps us identify groups of observations that are similar in terms of standardized variables We can observe how observations within each cluster are similar to each other, and there is a clear separation between clusters.
data("Boston")
# Standardize the dataset
boston_scaled <- scale(Boston)
# Calculate distances between observations
distances <- dist(boston_scaled)
k_means_result <- kmeans(boston_scaled, centers = 3)
k_max <- 10
twcss <- sapply(1:k_max, function(k) kmeans(boston_scaled, centers = k)$tot.withinss)
plot(1:k_max, twcss, type = 'b', main = 'Total WCSS vs. Number of Clusters', xlab = 'Number of Clusters', ylab = 'Total WCSS')
# Run k-means again with the optimal number of clusters
optimal_k <- 3
k_means_optimal <- kmeans(boston_scaled, centers = optimal_k)
# Visualize the clusters with pairs() function, using first 5 variables
pairs(boston_scaled[, 1:5], col = k_means_optimal$cluster)
Bonus
# k-means clustering
km <- kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
set.seed(123)
# determine the number of clusters
k_max <- 4
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
Super Bonus
Given code in the exercise:
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
Modified code:
The dimensions of your data might be causing the issue. The number of rows in your matrix_product is 404, but the color vector has a length of 506. We need to make sure that the vectors used for x, y, and z have the same length as the color vector.
library(plotly)
#set.seed(123)
#k_max <- 3
#km <- kmeans(Boston, centers = k_max)
model_predictors <- dplyr::select(train, -crime)
# Matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# Check the dimensions
dim(matrix_product)
## [1] 404 3
length(train$crime)
## [1] 404
# Plot 3D scatter plot with color based on crime classes
plot_ly(data = matrix_product, x = ~LD1, y = ~LD2, z = ~LD3,
type = 'scatter3d', mode = 'markers', color = as.factor(train$crime),
marker = list(size = 5)) # Adjust marker size as needed
Analysis Exercise
library(dplyr)
library(ggplot2)
library(GGally)
library(readr)
library(corrplot)
library(tibble)
library(tidyr)
#mine
human <- read_csv("data/human2.csv", show_col_types = FALSE)
#professor
#human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv", show_col_types = FALSE)
Point 1
#%keep <- c("Country", "Edu2.FM", "Labo.FM", "Life.Exp", "Edu.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")
#human <- select(human, one_of(keep))
#human <- filter(human, complete.cases(human))
#last <- nrow(human) - 7
#human <- human[1:last, ]
#move country name to rownames
human_ <- column_to_rownames(human, "Country")
str(human_)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : num 64992 42261 56431 44025 45435 ...
## $ Mat.Mor : num 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human_)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# visualize the 'human_' variables
ggpairs(human_, progress = FALSE)
# compute the correlation matrix and visualize it with corrplot
cor(human_) %>% corrplot()
#ggplot(human,aes(Life.Exp,GNI, label=Country)) + geom_point() + geom_text() + theme_bw()
After moving the country names to rownames, we were able to show a summarize of our data I used ggpairs and corplot to see the relationship between the variables. We can observe that the variables correlate between each other and helps us to see the central tendency, variability, and distribution of each variable in your dataset
The R console give us the range of values, the central tendency (mean, median), and the spread of the data for each variable. For example, for the variable “Edu2.FM.” the minimum value is 0.1717, and the maximum is 1.4967. The median is 0.9375, which indicates that the middle value of the data is close to this point.
From the plots we can conclude that our variable are very highly correlated.
The colors represented in our corplot represent the strength and direction of the correlation between pairs of variables, as we can see in the scale (around zero is little or no correlation). We can conclude that we have strong correlation for the circles with more intensity color and have bigger diemeter size. Our blue circles represent positive correlations. For example if the variable Edu.Exp increases then Life.Exp tends also to increase. For the other hand, the red circles means a negative correlation. For example, if Life.Exp increases then Mat.Mor tends to decrease.
Point 2 - PCA without standardizing
human <- column_to_rownames(human, "Country")
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# perform principal component analysis
pca_human <- prcomp(human_)
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
#s <- summary(pca_human)
#pca_pr <- round(1*s$importance[2, ], digits = 5)
#print(pca_pr)
#paste0(names(pca_pr), " (", pca_pr, "%)")
#pc_lab = paste0(names(pca_pr), " (", pca_pr, "%)")
#biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1] , ylab = pc_lab[2])
Point 3 and 4 - PCA with standardizing and interpretation
#human <- column_to_rownames(human, "Country")
human_std <- scale(human)
pca_human <- prcomp(human_std)
# create and print out a summary of pca_human
s <- summary(pca_human)
# rounded percentanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 5)
# print out the percentages of variance
print(pca_pr)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 0.53605 0.16237 0.09571 0.07583 0.05477 0.03595 0.02634 0.01298
# create object pc_lab to be used as axis labels
paste0(names(pca_pr), " (", pca_pr, "%)")
## [1] "PC1 (0.53605%)" "PC2 (0.16237%)" "PC3 (0.09571%)" "PC4 (0.07583%)"
## [5] "PC5 (0.05477%)" "PC6 (0.03595%)" "PC7 (0.02634%)" "PC8 (0.01298%)"
pc_lab = paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1] , ylab = pc_lab[2])
We can see that without standardizing the variables, the PCS is not very useful. The variance of the principle components are maximized. We can see with the chart without standardize variables that the double axes, related to GNI values, has values very big and that the data is not comparable. The GNI has such big values that all the variables that have small values hey don’t have enough variance. This is why is so important to have the variables in the same space (standardization) to be able to compare them more equal in the same analysis, as we did with the function scale().
By doing the PCS we can see how important this principle components are and understand if our multiple variables are very highly correlated. Now we can see that scale above and right refer to the original values of the variables, so they are in the range to -10 to 10. Then we move to principle components and we maximize de variance of the horizontal axis (PC1) . We can conclude that Labo.F and Parli.F are related to the second principle component. they are very independent from the other variables that are in other PC axis.
By doing the PCA we can conclude that 53.605% of the dataset can be summarized in one variable, which is PC1 and that 16.237% of the dataset can be explained in another variable, which is PC2. Our variable Parli.F and Labo.F are highly correlated and can explained around 16% of our original dataset.
Point 5
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.3.2
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
View(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) +
facet_wrap("name", scales = "free") +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
#MCA
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
The contribution of the variable categories (in %) to the definition of the dimensions can be extracted in the R cosnole under the table which contains the three dimensions. The dimensions (Dim.1, Dim.2, Dim.3) capture different patterns or structures in the categorical data. These patterns represent relationships or associations between categories of the variables. For example, in Dim.1, “how” and “where” have relatively high contributions (0.708 and 0.702, respectively), suggesting that these categories are important in explaining the patterns represented by Dim.1.
The bar plot suggests that the biggest percentages of the individuals prefer to drink tea alone, outside lunch hours. The tea bag is the most used and Earl Grey tea is drinking type most used. Most of the people buys tea in the chain store. The difference between the individuals that drinks tee with and without sugar is low.
The plot.MCA() generates a variable biplot, which displays the relationships between the variables, helping us to identify variables that are the most correlated with each dimension.
The axes of the plot represent the dimensions (factors) extracted by the MCA. Dimension 1 is represented by the horizontal axis and Dimension 2 by the vertical axis.The percentage of variability explained by each dimension is given: 15.24% for dimension 1 and 14.23% for the dimension 2. The distance of a point from the origin reflects the strength of its association with the dimensions. Categories farther from the origin have a stronger influence on the variability in the data. Along Dimension 1, we see on the map that ‘tea shop’ and ‘unpackaged’ are furthest away from the origin and therefore have the most importance (are the most correlated with dimension 1).